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Abstract 

Many communication channels with discrete input have non-uniform ca- 
pacity achieving probability mass functions (PMF). By parsing a stream of 
independent and equiprobable bits according to a full prefix-free code, a modu- 
lator can generate dyadic PMFs at the channel input. In this work, we show 
that for discrete memoryless channels and for memoryless discrete noiseless 
channels, searching for good dyadic input PMFs is equivalent to minimizing 
the Kullback-Leibler distance between a dyadic PMF and a weighted version 
of the capacity achieving PMF We define a new algorithm called Geometric 
Huffman Coding (GHC) and prove that GHC finds the optimal dyadic PMF in 
0(m\ogm) steps where m is the number of input symbols of the considered 
channel. Furthermore, we prove that by generating dyadic PMFs of blocks of 
consecutive input symbols, GHC achieves capacity when the block length goes 
to infinity. 

I. INTRODUCTION 

For many communication channels, the ultimate rate for reliable data transmission is 
given by the maximum information per cost. For discrete memoryless channels (DMC) 
and for additive noise channels with finite input alphabet, the ultimate rate is the maximum 
mutual information between input and output per channel use. For memoryless discrete 
noiseless channels (DNC), the ultimate rate is the maximum entropy of the input per 
average weight. In both cases, the maximum is achieved by an input that is distributed 
according to a capacity achieving probability mass function (PMF). To use non-uniform 
input PMFs in a digital communication system, a modulator has to generate this PMF by 
mapping independent equiprobable data bits to the channel input symbols. One way to 
do this is to parse the data bits by a full prefix-free code and to map each codeword to 
an input symbol [1, Sec. VII]. PMFs that can be generated in this way are dyadic, i.e., 
the probability of each point is of the form 2~ £ ,£ E N. The capacity achieving PMFs 
are in general not dyadic, which raises two questions. First, what is an optimal dyadic 
PMF that maximizes information per cost, and second, if we jointly generate blocks of 
consecutive input symbols by a dyadic PMF, can we asymptotically achieve capacity by 
letting the block length go to infinity. 

For noiseless channels, an efficient algorithm to find the optimal dyadic PMF that 
maximizes entropy per average weight was found in [2]. In general, a common approach 
in the literature is to use the dyadic PMF that results from the optimal source code of 
the capacity achieving PMF. Dyadic PMFs resulting from source codes are in general 
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not optimal. For the (d, k) constrained noiseless channel, it was claimed in pj that a 
source code asymptotically achieves capacity. To the best of our knowledge, for DMCs, 
there exist no results in the literature on optimality and asymptotic behavior of dyadic 
PMFs. In |[TJ, Q, the authors use source codes for additive noise channels. While good 
numerical results are observed, optimality and asymptotic behavior are not assessed. In 
[|5j, input entropy per average weight is maximized for additive noise channels. This is 
in general not equivalent to the maximization of mutual information per channel use. 

Denote the capacity achieving PMF of a channel by p*. In this work, we show for 
DMCs that minimizing the Kullback-Leibler distance (KL) D(p||p*) over all dyadic 
PMFs p maximizes a lower bound on the achieved mutual information per channel use. 
For DNCs, we show that searching for the optimal dyadic input PMF is equivalent 
to minimizing the weighted KL-distance D(p||p* R ) = '^2 i Pi^og(pi/p* R ) over all dyadic 
PMFs p. The value of R is given by the fraction of the channel capacity that is achievable 
by dyadic PMFs. We introduce an algorithm called Geometric Huffman Coding (Ghc) 
and prove that Ghc minimizes D(p||cc) over all dyadic PMFs p, for any given vector 
x with non-negative entries. In particular, for x = p*, Ghc minimizes D(p||p*) and for 
x = p* R , Ghc minimizes D(p\\p* R ). The complexity of Ghc is (9(mlogm), where m 
is the number of input symbols of the considered channel. Furthermore, we show that, 
to asymptotically achieve capacity for DMCs and DNCs, the normalized KL-distance 
D(p'*' \\pW*)/k has to vanish for block length k — > oo. This is achieved by Ghc. Based 
on the present work, we show in [[6| that for finite signal constellations with average 
power constraint, Ghc achieves capacity. Ghc is as handy as Huffman coding and an 
implementation of Ghc in MATLAB is readily available at our website [J7J. 

The remainder of this work is organized as follows. In Section [D] we define Ghc. In 
Section [TTTl we show optimality and asymptotic optimality of Ghc for DMCs. We show 



optimality and asymptotic optimality of Ghc for DNCs in Section IV 

II. Geometric Huffman Coding 
For a PMF p and a vector x with non-negative entries, the KL-distance is given by 

Pi 



B(p\\x) = \2 Pt log^. (1) 



Note that D(p||cc) can be equal to infinity. The dyadic PMF p that minimizes the KL- 
distance is directly given by the full prefix-free code that is constructed by the algorithm 
of the following proposition. A prefix-free code is full if it fulfills the Kraft inequality 
[8, Theorem 5.2.2] with equality. 

Proposition 1. Without loss of generality, we assume x\ > x<i > ■ ■ ■ > x m . The dyadic 
PMF p that minimizes D(p||cc) is obtained by constructing a Huffman tree with the 
updating rule 



i *^m— 1) If •Em—I — 4x m ^2>j 

li^J x m —\x mi if x m —\ <C Ax m . 



Since it involves a geometric mean, we call this method Geometric Huffman Coding. We 
write p = Ghc (a;). 



Proof: The proof is given in the appendix. 
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Fig. 1: For q = (0.328, 0.32, 0.22, 0.11, 0.022) T , the left figure displays the code tree of Ghc. The figure 
in the middle shows the code tree of Huffman coding. The right figure displays the code tree of 
Huffman coding applied to (qi, . . . , qi) T . 



An implementation of Ghc in MATLAB can be found at our website [|7J. In comparison 
to Ghc, Huffman coding uses the updating rule x' = x m + x m _i. Furthermore, it can be 
shown that Huffman coding minimizes the KL-distance D(cc||p) over all dyadic PMFs 
p. Note that this is not equivalent to minimizing ([T]) because the KL-distance is not 
symmetric in its arguments. Ghc has the same complexity as Huffman coding, which is 
0(m log m) [j9j, Chap. 16.3]. 

For illustration purpose, we apply Ghc and Huffman coding to the PMF 

q = (0.328, 0.32, 0.22, 0.11, 0.022) T (3) 

where (-) T denotes the transpose. The resulting code trees are displayed in Figure [TJ By 
reading off the codeword lengths, the corresponding dyadic PMFs are 

p GHC = (2-\ 2~ 2 , 2" 3 , 2" 3 , 0) T and p Hc = (2~ 2 , 2~ 2 , 2" 2 , 2' 3 , 2- 3 ) T (4) 

and the KL-distances to q are 

D(p GHC ||qr) = 0.13619 and D(p Hc \\q) = 0.19548 (5) 

where we used the dual logarithm. As expected, the KL-distance resulting from Ghc is 
smaller than the one that results from Huffman coding. Since Ghc assigns zero to #5, 
one may want to manually assign probability zero to q$ and then apply Huffman coding 
to (51, ... , g 4 ) T '. The corresponding code tree is displayed in Figure II The corresponding 
PMF and the resulting KL-distance to q are respectively 

p Hc ' = (2~ 2 ,2- 2 ,2- 2 ,2- 2 ,0) T , D(pHc'llg) = 0.15523. (6) 

While p Uc ' slightly improves upon p Bc , the KL-distance is still larger than the one 
resulting from Ghc. 

Let q now denote some arbitrary PMF We consider k subsequent symbols that are 
independent and identically distributed according to q. We denote the joint PMF of these 
symbols by q( k \ Our aim is to show that for p^ = Gnc(q^), the normalized KL- 
distance D(p( fe ) \\q^)/k vanishes for k — > 00. To show this, we will need the following 
lemma, which shows the existence of dyadic PMFs with a bounded KL-distance for any 
PMF q. 

Lemma 1. Without loss of generality, q\ > q 2 > ■ ■ ■ > q m . Assign then pi = 2 _ L- lo S2 9d 
for % < k, and Pi = for i > k, where k < m is chosen^ such that Y1T=\ Pi = ^ Then 



'it can actually be shown that such k always exists, so Gcc is well-defined. 



p is a dyadic PMF and D(p\\q) < log 2. We call this method Greedy Channel Coding 
(GccJ and write p = Gcc(q). 

Proof: 

™ p . ™ 2 -L- io g2<?i j 
V{p\\q) = } Vi\og— <} Pi\og (7) 



m 

<^ Pt log ^ : (8) 

i=i qi 
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= y)p i bg^ = log2 (9) 

tr ® 

where the inequality in |7]) follows from the values that Gcc assigns to the Pi. ■ 
An implementation of Gcc in MATLAB is available at [7]. It is now easy to show 
the asymptotic behavior of Ghc. 

Proposition 2. For p (fe) = Ghc(<? (/c) ) it holds that 

D(p^\\qM) k 



oo 



k 

Proof: Define p {k) = Gcc(<? (fc) ). Then 



0. (10) 



D(p( fc )||g( fc )) < D(pW\\q^) < log 2 

— ~ k 

where the first inequality follows via Proposition [T] from the optimality of Ghc and where 
the second inequality follows from Lemma [Tj log2/A; goes to zero for k — ► oo and the 
statement of the proposition follows. ■ 

III. Discrete Memoryless Channel 

We now show how Ghc can be used to find dyadic PMFs that well approximate the 
capacity of DMCs. A DMC is specified by a set of m input symbols, a set of n output 
symbols and a matrix of transition probabilities (hji). An input PMF p relates to its 
corresponding output PMF r as 

hu ■ ■ ■ hir 

(12) 
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The mutual information between input and output is given by JlOl Eq. (8.73)] 




The capacity of a DMC is the maximum mutual information between input and output, 
where the maximum is taken over all input PMFs. To find the best dyadic input PMF, 
we need to solve the optimization problem 

maximize I(p) 
p 

subject to p is a PMF 

p 1 = 2-'',^N, 1 = l,..,m. (14) 



This is a nonlinear optimization problem with integer constraints and therefore intractable 
for practical purposes. In order to overcome this difficulty, we proceed as follows. First, 
we will drop the restriction to dyadic PMFs and characterize the capacity achieving PMF 
p* . Then, we will derive the penalty that results from using a PMF p different from p* . 
Finally, we will minimize this penalty over all dyadic PMFs. 
Capacity and capacity achieving PMF are respectively defined as 

C = maxX(p), p* = argmaxX(p). (15) 
p P 

Denote by r and r* the output PMFs that result from using the input PMFs p and 
p*, respectively. According to [ fTTj Eq. (4.5.1)], the output PMF r* resulting from the 
capacity achieving PMF p* has the important property that 

hji log — = C, whenever p* > 0. (16) 



We now use this property to express the mutual information X(p) achieved by some PMF 
p in terms of capacity C and capacity achieving PMF p*. The only assumption that we 
make about p is that 

Pi = 0, whenever p* = 0. (17) 
Under this assumption, we have for X(p) 

Ap) = E ^ E h * lo s 7 1 = E Pi E h 3i lo s 7^ (! 8) 

i j 3 i j 3 3 

= I>E h * lo s ^ + E ^ E ^ lo s v. w 

4 n J o <i ^ 



c -E(E^) log ^ (20) 

j i 3 

c -E^ lo sS w 
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3 



C-D(r||r*) (22) 



where equality in ( |20| ) follows from ( |T6| ) and ( [T7] ). From the last line, we see that the 
penalty of using p instead of p* is exactly the KL-distance between the corresponding 
output PMFs r and r*. To get a simple expression that directly depends on p and p*, 
we lower bound the last line. According to (8j Eq. (4.45)] the KL-distance between 
the output PMFs is upper-bounded by the KL-distance between the input PMFs, i.e, 
D(r\\r*) < D(p\\p*). Thus, 

X(p)>C-D(p||p*). (23) 

We conclude that for DMCs, the penalty that results from using p instead of p* is upper 
bounded by D(p||p*). According to Proposition [Tj we can now efficiently minimize the 
penalty bound over all dyadic input PMFs p by using p = Ghc(p*). Note that Ghc 



guarantees ( T7J ): assume p* is ordered and p* m = 0, p^-i > 0. Then p* m _ x > Ap* m and 



Ghc assigns p m = 0. 



We now jointly consider the PMF of k consecutive channel inputs. Denote by p( fc )* 
the capacity achieving joint PMF. Since the channel is memoryless, p( fc )* is the product 
of k marginal PMFs p* and we have X(p^*) = kX(p*) = kC. Thus, for a joint PMF 
p( k ) we have 

Z(p (fe) ) >kC- D(p (fc) \\p (k) *). (24) 
The mutual information per channel use Z(pW) = X{p { ~ k ">)/k is thus given by 

I(p(*)) > C - U[P } P > . (25) 

k 

By using p^ = GHC(p^*), according to Proposition [2J X{p^) — > C for k — > 00 and 
we conclude that Ghc is asymptotically capacity achieving. 

IV. Memoryless Discrete Noiseless Channel 



Following [12 1, a memoryless DNC is given by a finite alphabet A = (ai, . . . , a m ) 
of atomic symbols and an associated weight function w : A — > M>o, a% <->■ Wi > 0. The 
information rate "H that is transmitted over the channel is given by the entropy of the 
input PMF divided by the average weight, i.e., 

U{p) 

T,iPi w i 



H(p) = , with H(p) = -^Pilogpi. (26) 



To find the dyadic PMF that maximizes %(p), we need to solve the optimization problem 

maximize "H (p) 
p 

subject to p is a PMF 

Pi = 2-4,4 eN,i = l,...,m. (27) 

As in the case of DMCs, this is an intractable nonlinear optimization problem with 
integer constraints. We will therefore proceed in the same way as we did for the DMC 



in Section III We start by calculating the capacity and the capacity achieving PMF p*. 
This can be done by Lagrange Multipliers, see, e.g., JT3J. Denote by b the base of the 
logarithm log. The capacity is achieved by the input PMF 

p* = 6- Ctti , i = l,...,m (28) 
where C denotes capacity and is given by the greatest positive real solution of the equation 



From ( |28| ), we have the relation W{ = — i logp*. We can thus write 

Pi w t = ~ £ Pi lo S P* ■ ( 30 ) 

i i 

Denote by R the fraction of C that can be achieved by the best dyadic PMF p, i.e., 

p = argmax'H(p), R = (31) 

p dyadic 



In general, R is not known beforehand, but we will show in Subsection IV-A how it can 
be found. Suppose for now that we know R. Assume further that 



Pi = 0, whenever p* = 0. (32) 

Furthermore, we use the convention OlogO = 0. With these assumptions, we can now 
write H(p) as 

n , p) = -RE^logP* + REiPilogPi +2M (33) 

i 

= RC _ EiP» log Pi - REiPitogp* (34) 

E Pi w i 
i 

= RC — = RC - if ; (35) 

EPi W i l^iPiWi 



where in ( |34| ), we used ( |30| ) and the definition of entropy. By ( |3Tj ), for the best dyadic PMF 
p we have %(p) = RC. It follows that for any dyadic PMF p, we have D(p||p* R ) > 
and for the best dyadic PMF p, we have D(p\\p* R ) = 0. We conclude that for DNCs, 
the best dyadic PMF is found by minimizing D(p||p* R ) over all dyadic PMFs p and 
by Proposition [Tj this PMF is given by p = Ghc(p* r ). Recall that, as we argued in 



Section Hm Ghc guarantees < |32| ) 



We now consider the PMF of k consecutive symbols. We denote the corresponding 
weights by w^ k \ The capacity achieving joint PMF is the product of k copies of p* 
and we denote it by p^*. Clearly, > kw min for % = l,...,m k where w min = 
mm{wi, . . . , w m }. Using this, we get for T-L{jp^) the lower bound 

- (fc)) = n(pW) = HfrW) + EiP? ] iggpg - TJt } iogp! fc> (36) 
Ei Pi k) Wi k) ~h Ei Pi k) lo s Pi k) * 

D(p( fe )||p( fe )*) 

= C ^ (fc) (fc) ^ 37 ^ 

>C-^ n D(P f j . (38) 

For p( fc ) = Ghc(j>^*), according to Proposition [2J the last term in the last line vanishes 
for k — > 00 and we have fl(p^ k ') — > C, thus Ghc is asymptotically capacity achieving. 

A. Finding R 

The exact value of R is in general not known beforehand. However, R and the best 
dyadic PMF can be found iteratively by the Lempel-Even-Cohn (Lec) algorithm pi. 
The idea of the algorithm is to start with some R, then find the best dyadic PMF for 
this R, and then update the value of R. The best PMF for a given R is in the original 
formulation of the Lec algorithm found as follows. A subset of £ nonzero entries of p is 
chosen. A Huffman- like procedure of complexity C(mlogm) then finds the best dyadic 
PMF with £ nonzero entries. There are m — 1 values for £ that have to be evaluated, the 
complexity of the overall procedure is thus roughly 0(m 2 logm). 



Algorithm 1 Finding R and the optimal dyadic PMF for DNCs 



i: procedure LEC(p*) 

2: R±- 1 

3: while D(p\\p* R ) ^ do 

4: p GHC^*^) 

5: r <- n(p)/c 

6: end while 

7: end procedure 



From ( [35] ) and a careful study of the original formulation in [|2j Sec. III,V], it can 



be shown that the iteration step is equivalent to minimizing the weighted KL-distance 
D(p\\p* R ). This can be done with complexity C(mlogm) by Ghc. A formulation of 
the complete Lec algorithm with Ghc as the iteration step is provided in Algorithm [T] 
Besides improving the complexity of the iteration step from C(m 2 logm) to O(mlogm), 
our formulation answers a question that was raised in p"4| , namely how the LEC algorithm 
could be used to find the dyadic PMF that minimizes the KL-distance D(p||p*). The 
simple answer is to perform the iteration step once with R — 1. An implementation in 
MATLAB of our formulation of the LEC algorithm can be found at Q. 

Appendix 

Denote by x some non-negative vector with m entries. Assume x is ordered, i.e., 
Xi > x 2 > ■ ■ ■ > x m . We now show that Ghc minimizes D(p||cc) over all dyadic PMFs 
p. The PMF p is dyadic if and only if there exist numbers £{ G N, i = 1, . . . , m, such 
that pi = 2~ £i ,Vi and ^2~^ = 1. This is equivalent to £i,...,£ m being the codeword 



lengths of a full prefix-free code [10 Sec. 2.3.2]. Using this, we can write 

D(p\\x) = Tp t log ^ = log(2) T Pi \og 2 ^ (39) 

% I 

= log(2) £ 2-4 (- log 2 x t - ii). (40) 

i 

We define u by Ui = — log 2 Xj,Vz. Omitting the constant factor log 2, our aim is thus to 
minimize 



Ui - h) (41) 



subject to £i, . . . ,£ m are the codeword lengths of a full prefix-free code. Based on ( |4T| ), 
we now prove the optimality of Ghc in a way similar to the proof given in [10, Sec. 
2.5.3] for the optimality of Huffman coding. 

Assume for now that an optimal algorithm assigns finite values to the codeword lengths 
£ m and £ m -i of the two least likely symbols, which correspond to the greatest entries u m 
and w m _i of u. We now show that in this case, there is an optimal algorithm for which 
£ — £ i 



Lemma 2. For an optimal algorithm, Ui > Uj implies £i > 



Proof: Assume the contrary, i.e., u i: > Uj and l± < £j. Consider the term 

2- e *(u i -t i )+2- t >(u j -£ j ). (42) 
By interchanging £{ and £j, the term decreases: 

[2- e i( Ui - £ 3 ) + 2- £i ( Uj - ii)] - [2- &i { Ui - ii) + 2-^{ Uj - lj)] (43) 
= T l * -Uj) + 2^ ( Uj - (44) 
= ( 2~ ei - 2- e *) (v,j - Ui) < (45) 

so any code with ui > Uj and < £ 3 - is not optimal. ■ 

Lemma 3. There is an optimal algorithm for which the codewords of the two greatest 
entries u m and u m _i are siblings, i.e., £ m = £ m -i, and in addition, no other codeword 
is longer than £ m and £ m -\. 

Proof: In a full prefix-free code, the sibling of the longest codeword is also a longest 
codeword. According to Lemma [2j if u m > u m _i > u m _ 2 > ■ ■ ■ , an optimal algorithm 
assigns the two longest codewords to u m and u m -\. If only u m > u m _i > u m _2 > • - • , 
assigning the two longest codewords to u m and u m -\ does not change optimality. ■ 
We can now use £ m = £ m -\ to rewrite ( |4Tj ): 

m m— 2 

(«» - U) = Yl ( Ui - ^ + 2 ~^ _1 1 - C-i) + 2"^ (w m - £ m ) (46) 

i=l i=l 

m-2 

= ^ 2"^(« i - £i) + 2- £ ™(w ro _ 1 + « m - 2£ m ) (47) 

i=l 
m-2 



£ 2~^ - £i) + 2-^-V [( ""-^ - l) - (C - 1) 



(48) 



m-2 



^2- < '(ii i -f j )+2-'(«'-0. (49) 



i=i 

Thus, by combining w m and w m _i through 

/ M m-1 "I" u m -, /cn\ 

u = 1 (50) 

the size m problem is reduced to a size m — 1 problem. 

The optimal algorithm may assign probability zero to the greatest entry u m , which 
corresponds to £ m = oo. We thus have 

m m— 1 m— 1 

- u) = Yl - £ <) + 2 ~°>* - °°) = J2 2 ~ 1 '^ - ^ (51) 

i=l i=l i=l 

where we used the convention — OlogO = and equivalently 2~°°oo = 0. Thus, if 
£ m = oo, the size m problem reduces to a size m — 1 problem. 

It remains to check if it is better to assign probability zero to u m or to combine u m 
and u m -\. First, assume the algorithm combines u m and u m -\- Then the contribution 



to the sum ( |4T| ) is 2 e> \u' — £'). We can now assign probability zero to u m and use the 
codeword of u' for w m _i. The contribution of u m to (j4Tb is then zero and the contribution 



of u m -i is 2 (u m -i — £'). Thus, since our aim is to minimize ( |4T| ), doing the former is 
better if and only if 

2~ 1 ' - t) > 2~ 1 ' (V - £') (52) 
* {um-i ~ > ( Um ~ l + Um - l) - 1' (53) 

* Um-l > Um ~^ Um ~ 1 (54) 
<^> w OT _! > w m - 2. (55) 



Recalling Xj = 2 "% the updating rule (|50]) and the condition ( [55] ) can be expressed 
in terms of tc as 

J ^m— 1) — (56) 

|^ 2^/ x m —\x mi if x m —\ <C Ax m . 
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